Skip to content

gh-148762: Speed up multiline regexes anchored by ^#152339

Open
haampie wants to merge 4 commits into
python:mainfrom
haampie:hs/fix/multiline-caret
Open

gh-148762: Speed up multiline regexes anchored by ^#152339
haampie wants to merge 4 commits into
python:mainfrom
haampie:hs/fix/multiline-caret

Conversation

@haampie

@haampie haampie commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Multiline regexes of the form re.compile("^foo", re.MULTILINE) currently
fall into the generic search loop, which calls SRE(match) at every
position in the subject string. Since a ^-anchored (SRE_AT_BEGINNING_LINE)
pattern can only match at the start of the string or right after a linebreak,
we can instead jump from one line start to the next, skipping all the
intermediate positions.

Benchmarks show good improvements in runtime across UCS-1/2/4; full
numbers are in the issue.

haampie added 3 commits June 26, 2026 19:53
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>
Signed-off-by: Harmen Stoppels <harmenstoppels@gmail.com>

@eendebakpt eendebakpt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggests adding some tests for coverage. Not sure we need all of them, but including in details here for reference.

Details ``` def test_search_anchor_at_beginning_line(self): # gh-148762: a multiline "^" search jumps between line starts. These # cases pin the behaviour the optimization must preserve. for pattern, cases in [ ('^', [ ('', [(0, 0)]), ('abc', [(0, 0)]), ('\n', [(0, 0), (1, 1)]), ('\n\n', [(0, 0), (1, 1), (2, 2)]), ('a\n', [(0, 0), (2, 2)]), # match at end after \n ('\na', [(0, 0), (1, 1)]), ('a\nb\nc', [(0, 0), (2, 2), (4, 4)]), ('a\n\nb', [(0, 0), (2, 2), (3, 3)]), # empty line ('\n\n\n', [(0, 0), (1, 1), (2, 2), (3, 3)]), ]), ('^a', [ ('a', [(0, 1)]), ('a\na', [(0, 1), (2, 3)]), ('a\nba\na', [(0, 1), (5, 6)]), ('ba\nab', [(3, 4)]), ('a\n', [(0, 1)]), # no match-at-end: needs 'a' ('\na', [(1, 2)]), ('aa\naa', [(0, 1), (3, 4)]), ('a\n\na', [(0, 1), (3, 4)]), ('a\nĀa\na', [(0, 1), (5, 6)]), # UCS2 string kind ('Ā\na\nĀ', [(2, 3)]), ('a\n\U0001F600a\na', [(0, 1), (5, 6)]), # UCS4 string kind ('\U0001F600\na', [(2, 3)]), ]), ]: p = re.compile(pattern, re.MULTILINE) for s, expected in cases: with self.subTest(pattern=pattern, string=s): self.assertEqual([m.span() for m in p.finditer(s)], expected)
    # bytes (8-bit) path
    pb = re.compile(b'^a', re.MULTILINE)
    for s, expected in [(b'a\nba\na', [(0, 1), (5, 6)]), (b'a\n', [(0, 1)]),
                        (b'\na', [(1, 2)]), (b'abc', [(0, 1)])]:
        with self.subTest(string=s):
            self.assertEqual([m.span() for m in pb.finditer(s)], expected)

    # pos / endpos: the search may begin mid-line or on a line start
    pa = re.compile('^a', re.MULTILINE)
    self.assertEqual([m.span() for m in pa.finditer('xa\na', 1)], [(3, 4)])
    self.assertEqual([m.span() for m in pa.finditer('a\na', 2)], [(2, 3)])
    self.assertEqual([m.span() for m in pa.finditer('a\na\na', 1, 3)], [(2, 3)])
    self.assertEqual([m.span() for m in pa.finditer('a\na', 0, 1)], [(0, 1)])

    # sub / subn / split also drive search()
    pc = re.compile('^', re.MULTILINE)
    self.assertEqual(pc.sub('#', 'a\nb\nc'), '#a\n#b\n#c')
    self.assertEqual(pc.sub('#', 'a\nb\n'), '#a\n#b\n#')
    self.assertEqual(pc.subn('#', 'a\nb\n'), ('#a\n#b\n#', 3))
    self.assertEqual(pc.split('a\nb'), ['', 'a\n', 'b'])
    self.assertEqual(pc.split('a\nb\n'), ['', 'a\n', 'b\n', ''])
</details>

Comment thread Modules/_sre/sre_lib.h
Comment on lines +1864 to +1867
while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
ptr++;
if (ptr >= end)
return 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be

Suggested change
while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
ptr++;
if (ptr >= end)
return 0;
+#if SIZEOF_SRE_CHAR == 1
ptr = memchr(ptr, '\n', end - ptr);
if (ptr == NULL)
return 0;
#else
while (ptr < end && !SRE_IS_LINEBREAK(*ptr))
ptr++;
if (ptr >= end)
return 0;
#endif

(I did not benchmark, not sure it is worth the change)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants